Prepared by Rafee Shaik, rafees@umich.edu
github repository: https://github.com/rafeesumich/Oklahoma-Education-Graduation-Rate-Analysis
This work is a study of Oklahoma state graduation rates by ethnic groups in the year 2017. You can get the yearly graduation Rate data from Oklahoma state education department website. Here is the link: https://sde.ok.gov/sde/sites/ok.gov.sde/files/documents/files/2017%20Graduation%20Rate.xlsx
Dataset: This dataset contains graduation rates at three levels.
Procedure:
I'm going to use mainly three visualization techniques for this study: I'm going to do a univariate analysis, best visualization techniques for univariate analysis are Boxplots, Violinplots, and Histograms
Box plots:
Box plots are a good starting point for univariate numerical analysis,
Which means it analyzes single set of data.
It summarizes this data into Quartiles: Q1(25%), Q2(50%) and Q3(75%).
And show us inter quartile ranges (IQR) in a box, outliers will also be shown. If needed it can also show us mean as well.
When to use it: To get a quick summary of single variable.
Limitations: Box plots cant show distribution of data within a single dataset, Consider violin plot for more detailed study of univariate analysis
Violin plots:
Violin plots are best to analyze single variate numerical data. Violin plots can show us distribution of underlying dataset through kernel density function. The distribution can help us identify different categories of data within a single variable. Violin plots will also show us outliers, IQR (Inter Quartile Range), and mean when needed.
each side of the violinplot can be plotted using different categories of data within same dataset.
Limitations: Use a histogram for full univariate analysis. Even though Violin plot has Kernel Density function builtin, the control over the visualization is limited.
When to use: To look at the underlying data distribution within a single variable.
Histogram: Histograms are another univariate analysis visualization technique. Histograms are best fit to understand the distribution of data. It divides the variable value range into equally sized bins and calculates the frequency of the variable within each bin. Histograms are great fit for probability distribution of continuous variable.
Other visualization techniques used in this analysis:
Overview:
Main visualization library used in this analyis is seaborn.
Seaborn is good fit for my analysis in this paper. It provides abstraction layer on top of matplotlib,It produces attractive and informative statistical graphics, has good integration with pandas library. Seaborn visualization library is created on top of Matplotlib. All the matplotlib kwargs will work with seaborn. Its a opensource, 'free to use' library.
Creater of this library: Michael Waskom
How to install it:
from anaconda prompt: conda install seaborn
from python terminal: pip install seaborn
Limitation: Lack of support for interactivity
Here are the three packages you need for this analysis those are not installed on coursera jupyter notebooks
pip install watermark
pip install geopandas
pip install descartes
%load_ext watermark
%watermark -v -m -p pandas,seaborn,matplotlib,numpy,geopandas,warnings
Rule 1: Tell a story for an audience
Provided a good overview of this analysis in the overview section
Rule 2: Document the process, not just the results
I documented the entire process starting the download the data, cleaning it and analyzing it
Rule 3: Use cell divisions to make steps clear
Prepared code in individula code blocks and placed them in single jupyter cell
Rule 4: Modularize code
Code is modularized by analysis type, Here my analysis is divided into three parts
Rule 5: Record dependencies
Dependencies are listed there using watermark package
Rule 6: Use version control
Code and Data has been backed up and commited on Github
Rule 7: Build a pipeline
Rule 8: Share and explain your data
Code and Data has been made public on Github
Rule 9: Design your notebooks to be read, run, and explored
Organized the notebook into different sections so that it can be easily read, run and explored
Rule 10: Advocate for open research
There are sections in this notebook that gives opportunity for further reasearch and analysis
#Import required libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import geopandas as gpd
#Supress warnings
import warnings
warnings.filterwarnings('ignore')
#Get the dataset
#df_orig=pd.read_excel('https://sde.ok.gov/sde/sites/ok.gov.sde/files/documents/files/2017%20Graduation%20Rate.xlsx')
#I saved a copy of this file in github, you can use it if you're not able to find it on sde.ok.gov website
df_orig=pd.read_excel('2017 Graduation Rate.xlsx')
This dataset is created by state of Oklahoma every year. It contains county name, school district name, student graduation rate summerized by ethnicity at multiple levels of schooling: PK-to-8th grade, and 9th-to-12th. We used most recent data that is avialble for public from state of oklahoma. Elementry and Middle school gradualtion data is summerized at District level.
#Make a copy of original dataset.
df_state=df_orig.copy(deep=True)
#filter state level data only
df_state=df_state[df_state['EducationAgencyType']=='State']
#Data cleaning:
#remove percentage symboles and other panctuations from numerical columns and convert them into Pandas numerical type
#We are also going to sort the data by value to easily identify the most and least performing ethnic groups.
df_state['FourYearGraduationRate']=pd.to_numeric(df_state['FourYearGraduationRate'].str.replace('%',''))
df_state = df_state.sort_values('FourYearGraduationRate',ascending=False).reset_index(drop=True)
#Take a look at first few records in the dataframe
df_state.head()
#Prepare a bar chart of this data, on x-axis mark the ethnicity, and on y-axis plot the graduation rate of each ethinicity.
#Set the figure size to 12 by 8 inches, this is matplotlib property.
plt.figure(figsize=(12,8))
#Use seaborn barplot method to plot the Bar chart. Most of the seabon artist methods return axes that is same as in Matplotlib.
ax = sns.barplot(x=df_state['ReportSubgroup'],y=df_state['FourYearGraduationRate']
,order=df_state['ReportSubgroup']
)
# Use the axes object to set the X axis label and Y labels and Figure title
ax.set(xlabel='Ethnicity',ylabel='Granduation Rate',title='State level graduation rate in different ethnic groups')
#Since we hive long names for each bar we need to rotate them to 90 degree so that we can read them properly
for item in ax.get_xticklabels(): item.set_rotation(90)
#Also I'm going to put bar height at the top of the bar. rotated to 45 degree, plot this height as text. color it in magenta
for i, v in enumerate(df_state['FourYearGraduationRate'].iteritems()):
ax.text(i ,v[1], "{:,}".format(v[1]), color='m', va ='bottom', rotation=45)
From this graph you can clearly see Asians are most performing and English-Language-Learner are leat performing
#Prepare School and District level ETHINIC based dataframes
#Select only required columns
columns = ['CohortGraduationYear', 'EducationAgencyType',
'Districtcode', 'CountyName', 'DistrictName',
'GradeLevelLow', 'GradeLevelHigh', 'ReportSubgroup',
'FourYearGraduationRate']
df=df_orig[columns]
#Remove % sign and other markers from numerical data
df['FourYearGraduationRate'] = (df['FourYearGraduationRate'].str.replace('>=','')
.str.replace('%','').replace('***',np.nan).str.replace('<',''))
#We're going to study the data at ethnic level so remove aggregated information rows
#(marked by 'All' in 'ReportSubgroup' column) from data
df=df[df['ReportSubgroup']!='All']
#Remove null value rows
df=df.dropna()
#convert Graduation data to panda's numeric data type for easy analysis.
df['FourYearGraduationRate']=pd.to_numeric(df['FourYearGraduationRate'])
#Seperate this dataset into two sets, one with HighSchool level data and One upto 8th
#Grades '9 and up' dataset - Highschools
df_school= df[df['EducationAgencyType']=='School']
#Grades 'PK to 8th' dataset
df_district= df[df['EducationAgencyType']=='District']
#df[df['EducationAgencyType']=='District'].head()
df_school.tail()
#School level Analytics
#Set figure size to 16X8 inches
plt.figure(figsize=(16,8))
# On X axis we're going to put Ethinic groups,
#Seaborn will automatically split the data into these groups and give us a category wise boxplots
#and we want to show mean point as well.
ax2= sns.boxplot(x='ReportSubgroup',y='FourYearGraduationRate', data=df_school, showmeans=True)
#
for item in ax2.get_xticklabels(): item.set_rotation(90)
ax2.set(xlabel='Ethnicity',ylabel='Granduation Rate')
#Observe how I'm setting the title here, earlier I set the title using axes object along with X and Y axis labels.
#Increase the title side and use color green.
plt.gca().set_title('2017 Highschool Granduation Rates in Oklahoma by ethinicity', fontsize=18,color='g')
plt.show()
We can clearly see Graduation Rate changed at HighSchool level from state level, at Highschool level, 'Regular Education' group is performing better than 'Asian' group. English Language Learners are sill performing least.
Lets explore how data is distributed within each group, Violin plots are best in this case.
plt.figure(figsize=(16,8))
ax3= sns.violinplot(x='ReportSubgroup',y='FourYearGraduationRate', data=df_school,showmeans=True)
for item in ax3.get_xticklabels(): item.set_rotation(90)
plt.gca().set_title('2017 Highschool Granduation Rates in Oklahoma by ethinicity', fontsize=18,color='g')
ax3.set(xlabel='Ethnicity',ylabel='Granduation Rate')
plt.show()
# Violinplots have similar syntax compared to boxplots.
#but they will provide significant information within single visualization
# you can also use one side of the plot for one category and other side for another category with in the plot.
# use the 'hue' and 'split' parameters to split the data within the variable.
#Here is example syntax
#ax = sns.violinplot(x="day", y="total_bill", hue="smoker", data=tips, palette="muted", split=True)
The mean of of all ethnic groups is around 80%, except in group 'English Language Learners'. Also the data is skewed toward the first quantile of the set. From this graph we can conclude Graduation Retes are significantly lower in 'English Language Learners'
Objective: At school level analysis (above graph) we concluded that graduation rate in 'English Language Learners' is significanly lower when compared to other groups. We will prepare similar analysis and see if 'English Language Learner' group gradualtion rate is lowest at Elementry-Middle-school level as well.
#District level Analytics with boxplot
plt.figure(figsize=(16,8))
ax4= sns.boxplot(x='ReportSubgroup',y='FourYearGraduationRate', data=df_district, showmeans=True)
for item in ax4.get_xticklabels(): item.set_rotation(90)
ax4.set(xlabel='Ethnicity',ylabel='Granduation Rate')
plt.gca().set_title('2017 Primary and Midschool Granduation Rates in Oklahoma by ethinicity', fontsize=18,color='g')
plt.show()
# District level graduation rates with Violinplots
plt.figure(figsize=(16,8))
ax5= sns.violinplot(x='ReportSubgroup',y='FourYearGraduationRate', data=df_district,showmeans=True)
for item in ax5.get_xticklabels(): item.set_rotation(90)
plt.gca().set_title('2017 Primary and Midschool Granduation Rates in Oklahoma by ethinicity', fontsize=18,color='g')
ax5.set(xlabel='Ethnicity',ylabel='Granduation Rate')
plt.show()
Graduation rate in 'English Language Learner' is the lowest in both Highschool level and Elementry+midschool levels
Irrespective of ethnicity lets see how graduation rates are distributed within Highschool level and Mid+Elementry School levels
#Remove all the non-numerical charectars from numeric columns
df_all=df_orig[df_orig['ReportSubgroup']=='All']
df_all['FourYearGraduationRate']=df_all['FourYearGraduationRate'].str.replace('>','')
df_all['FourYearGraduationRate']=df_all['FourYearGraduationRate'].str.replace('%','')
df_all['FourYearGraduationRate']=df_all['FourYearGraduationRate'].str.replace('<','')
df_all['FourYearGraduationRate']=df_all['FourYearGraduationRate'].str.replace('=','')
df_all=df_all[df_all['FourYearGraduationRate']!='***']
### Divide this data into two datasets, one at HighSchool level and one for Elementry+Mid School level
#Prepare school level and District level data frames
df_all['FourYearGraduationRate']=pd.to_numeric(df_all['FourYearGraduationRate'])
#Grades '9 and up' dataset - Highschools
df_school_all= df_all[df_all['EducationAgencyType']=='School']
#Grades 'PK to 9th' dataset
df_district_all= df_all[df_all['EducationAgencyType']=='District']
Distribution of Highschool graduation rates on Histogram
#Histograms in seaborn can be ploted using the function distplot. Histograms take single dimention data,
#optionally you can provide bin size with parameter 'bins'
#and ask to plot density estimation as well with parameter 'kde=True/False'
#you can also indicate whether you want to plot rugplot or not.
plt.figure(figsize=(14,8))
ax4=sns.distplot(df_school_all['FourYearGraduationRate']
,bins=30
,kde=True
,rug=False
)
ax4.set(ylabel='Density',xlabel='Granduation Rate', title='Highshool level graduation rate distribution')
plt.show()
#Observe the rug plot option usage in this graph
plt.figure(figsize=(14,8))
ax5=sns.distplot(df_district_all['FourYearGraduationRate'], bins=30
,kde=True
,rug=True
)
plt.show()
We will be using geopandas for this analysis to plot a heat map over okalahoma with School level graduation rates
#
#School districts Shape file
#We need Oklahoma county and school district level shape files for this analysis
fp_cousub=r'SchoolMaps/tl_2017_40_cousub.shp'
#Download it from github
#fp_cousub='tl_2017_40_cousub.shp'
#you will read the shape file using geopandas's read_file method.
map_cousub= gpd.read_file(fp_cousub)
#Downsize the school level graduation rates to district level
downsized_school_df = df_school_all[['DistrictName','FourYearGraduationRate']]
downsized_school_df=downsized_school_df.groupby('DistrictName').agg({'FourYearGraduationRate':np.mean})
downsized_school_df=downsized_school_df.reset_index()
#Make sure District names are matching, turn both names to upper case
downsized_school_df.columns = ['NAME', 'FourYearGraduationRate']
downsized_school_df['NAME'] = downsized_school_df['NAME'].str.upper()
map_cousub['NAME'] = map_cousub['NAME'].str.upper()
#Merge shapefile with school level graduation rate dataframe
map_GradScore = map_cousub.merge(downsized_school_df, how='left',on='NAME')
#Plot the Graduation Rate data over Oklahoma school district areas
fig, ax = plt.subplots(1, 1,figsize=(15,15))
map_GradScore.plot(column='FourYearGraduationRate', ax=ax
#,cmap='tab20'
,edgecolor='dimgrey'
,legend=True
,legend_kwds={'label': "Graduation rate legend", 'orientation': "horizontal"})
ax.set_title('Oklahoma 2017 School district level graduation rate',fontsize=20)
ax.set_axis_off()
plt.show()
#Preparing a county level heatmap
fp_county = "CountyMaps/tl_2017_us_county.shp"
#I shared this file on github
map_county_df = gpd.read_file(fp_county)
#Get oklahoma county data
map_df_ok_county=map_county_df[map_county_df['STATEFP']=='40']
# check data type so we can see that this is not a normal dataframe, but a GEOdataframe
#map_df_ok_county.head()
#Further downsample the school graduation data to county level
downsized_county_df = df_school_all[['CountyName','FourYearGraduationRate']]
downsized_county_df=downsized_county_df.groupby('CountyName').agg({'FourYearGraduationRate':np.mean})
downsized_county_df=downsized_county_df.reset_index()
downsized_county_df.columns=['NAME', 'FourYearGraduationRate']
downsized_county_df['NAME']=downsized_county_df['NAME'].str.strip()
map_df_ok_county['NAME']=map_df_ok_county['NAME'].str.upper()
map_df_county_join = map_df_ok_county.merge(downsized_county_df,how='left',on='NAME')
fig, ax = plt.subplots(1, 1,figsize=(15,15))
map_df_county_join.plot(column='FourYearGraduationRate', ax=ax
#,cmap='tab20'
, edgecolor='dimgrey'
,legend=True
,legend_kwds={'label': "Graduation rate legend", 'orientation': "horizontal"})
ax.set_title('Oklahoma 2017 county level graduation rate',fontsize=20)
ax.set_axis_off()
plt.show()
#lets combine the graduation data with county population data
#and see if there is any correlation exists between country population and graduation rates
#download the oklahoma population data from link below
#https://www.oklahoma-demographics.com/counties_by_population
#Since extenal URLs are taking furever to respond I downloaded this datafram to a csv and attached to this assignment
#population = pd.read_html('https://www.oklahoma-demographics.com/counties_by_population')[0]
population = pd.read_csv('population.csv',delimiter='|')
population=population.iloc[:77]
population['Population'] = pd.to_numeric(population['Population'])
population['County'] = population['County'].str.replace(' County','')
population=population.rename(columns={'County':'NAME'})
population['NAME']=population['NAME'].str.upper()
pop_grd_rate_df = population.merge(downsized_county_df, how='left', on ='NAME')
#Seaborn scatterplot dipecting Population and graduation rates at county level
sns.scatterplot( x='Population',y='FourYearGraduationRate',data=pop_grd_rate_df)
#Looks like there are three large counties that are causing our data to be skewed to the right,
#so we're going to filter them out
sns.scatterplot( x='Population',y='FourYearGraduationRate',data=pop_grd_rate_df.iloc[3:])
import plotly.express as px
fig = px.scatter(pop_grd_rate_df.iloc[3:], x='Population',y='FourYearGraduationRate', trendline="ols")
fig.show()
There seem to be a low level positive correlation between county population and graduation rate.
#national_graduationRate_df = pd.read_html('https://www.governing.com/gov-data/education-data/state-high-school-graduation-rates-by-race-ethnicity.html')[0]
national_graduationRate_df = pd.read_csv('national_graduationRate_df.csv',delimiter='|')
#URL is: 'https://www.governing.com/gov-data/education-data/state-high-school-graduation-rates-by-race-ethnicity.html'
#national_graduationRate_df.head()
national_graduationRate_df.plot(kind='box')
One can take the oklahoma graduation rate in different ethnic groups and compare it againest National level graduation data